py_worksheet_wrangling (Score: 76.0 / 76.0)

  1. Test cell (Score: 3.0 / 3.0)
  2. Test cell (Score: 3.0 / 3.0)
  3. Test cell (Score: 3.0 / 3.0)
  4. Test cell (Score: 3.0 / 3.0)
  5. Test cell (Score: 3.0 / 3.0)
  6. Test cell (Score: 3.0 / 3.0)
  7. Test cell (Score: 3.0 / 3.0)
  8. Test cell (Score: 1.0 / 1.0)
  9. Test cell (Score: 3.0 / 3.0)
  10. Test cell (Score: 1.0 / 1.0)
  11. Test cell (Score: 3.0 / 3.0)
  12. Test cell (Score: 3.0 / 3.0)
  13. Test cell (Score: 1.0 / 1.0)
  14. Test cell (Score: 3.0 / 3.0)
  15. Test cell (Score: 3.0 / 3.0)
  16. Test cell (Score: 3.0 / 3.0)
  17. Test cell (Score: 3.0 / 3.0)
  18. Test cell (Score: 3.0 / 3.0)
  19. Test cell (Score: 1.0 / 1.0)
  20. Test cell (Score: 1.0 / 1.0)
  21. Test cell (Score: 3.0 / 3.0)
  22. Test cell (Score: 3.0 / 3.0)
  23. Test cell (Score: 1.0 / 1.0)
  24. Test cell (Score: 3.0 / 3.0)
  25. Test cell (Score: 1.0 / 1.0)
  26. Test cell (Score: 3.0 / 3.0)
  27. Test cell (Score: 3.0 / 3.0)
  28. Test cell (Score: 3.0 / 3.0)
  29. Test cell (Score: 3.0 / 3.0)
  30. Test cell (Score: 3.0 / 3.0)

Worksheet: Cleaning and wrangling data

This worksheet covers the Cleaning and wrangling data chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet.

In [1]:
### Run this cell before continuing.
import altair as alt
import pandas as pd

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')
Out[1]:
DataTransformerRegistry.enable('vegafusion')

Question 0.0 Multiple Choice:
{points: 1}

Which of the following characterize a tidy dataset? note - there may be more than 1 correct answers to this question

A) Each row is a single variable

B) There are no missing or erroneous values

C) Each value is a single cell

D) Each variable is a single column

Assign your answer to an object called answer0_0 in the code chunk below. Make sure your answer contains uppercase letters and surround it with quotation marks and square brackets. If there are more than one answers to this question, separate each letter with a comma within the square brackets. For example if you believe the answer is A, B and C your answer would like this: answer0_0 = ['A', 'B', 'C']

In [2]:
Student's answer(Top)
answer0_0 = ['C', 'D']
In [3]:
Grade cell: cell-0bea903e0f37148c Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer0_0)).encode("utf-8")+b"dcefd").hexdigest() == "d2b356229360b49866bacc1b93ce284fa1b92f4a", "type of answer0_0 is not list. answer0_0 should be a list"
assert sha1(str(len(answer0_0)).encode("utf-8")+b"dcefd").hexdigest() == "b308e7b8cf3ead94f053e4a6638b2435dff73943", "length of answer0_0 is not correct"
assert sha1(str(sorted(map(str, answer0_0))).encode("utf-8")+b"dcefd").hexdigest() == "5aa68520ec860e9e5a9a1e74c74d6deaec742c1a", "values of answer0_0 are not correct"
assert sha1(str(answer0_0).encode("utf-8")+b"dcefd").hexdigest() == "5aa68520ec860e9e5a9a1e74c74d6deaec742c1a", "order of elements of answer0_0 is not correct"

print('Success!')
Success!

Question 0.1 Multiple Choice:
{points: 1}

The data below is wine ratings given for 3 wines by 5 different wine tasters. We are interested in seeing if Taster or Wine type influences the rating. Given that motivation, which arrangement of the data set show below is "tidy"?,

Data set 1:
Taster Chardonnay Pinot Grigio Pinot Blanc
001 75 89 92
002 89 88 89
003 72 90 95
004 85 81 90
005 83 89 88
Data set 2:
Wine Taster 001 Taster 002 Taster 003 Taster 004 Taster 005
Chardonnay 75 89 72 85 83
Pinot Grigio 89 88 90 81 89
Pinot Blanc 92 89 95 90 88
Data set 3:
Taster Wine Rating
001 Chardonnay 75
002 Chardonnay 89
003 Chardonnay 72
004 Chardonnay 85
005 Chardonnay 83
001 Pinot Grigio 89
002 Pinot Grigio 88
003 Pinot Grigio 90
004 Pinot Grigio 81
005 Pinot Grigio 90
001 Pinot Blanc 92
002 Pinot Blanc 89
003 Pinot Blanc 95
004 Pinot Blanc 90
005 Pinot Blanc 88
Data set 4:
Taster Chardonnay Rating
001 75
002 89
003 72
004 85
005 83
Taster Pinot Grigio Rating
001 89
002 88
003 90
004 81
005 90
Taster Pinot Blanc Rating
001 92
002 89
003 95
004 90
005 88

Assign your answer to an object called answer0_1. Make sure your answer is surrounded by square brackets. If there are more than one answers to this question, separate each number with a comma in the square brackets. For example if you believe the answer is 1, 2 and 3 your answer would like this: answer0_1 = [1, 2, 3]

In [4]:
Student's answer(Top)
answer0_1 = [3]
In [5]:
Grade cell: cell-7bc5804c8cd5900d Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"82a97").hexdigest() == "06675e978588f0ff1916949852fce5918250da38", "type of answer0_1 is not list. answer0_1 should be a list"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"82a97").hexdigest() == "02e678e6143abd600cc82e54ead4bfcce627447a", "length of answer0_1 is not correct"
assert sha1(str(sorted(map(str, answer0_1))).encode("utf-8")+b"82a97").hexdigest() == "fdad14c66b03a268975522797efe97fe6b348d92", "values of answer0_1 are not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"82a97").hexdigest() == "66f40af377bdefe94dac626404fad6dead19c219", "order of elements of answer0_1 is not correct"

print('Success!')
Success!

Question 0.2 Multiple Choice:
{points: 1}

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below:

Why is the primary goal of data wrangling getting dataframes into the tidy data format?

A) Having data expressed in such a way, allows for easier readability and is more aesthetically pleasing.

B) Tidy format uses less storage space on your computer.

C) Many or most modern Data Science tools accept the tidy data format directly (or very close to that) and we need to get the data in a state ready for analysis.

Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called answer0_2.

In [6]:
Student's answer(Top)
answer0_2 = "C"
In [7]:
Grade cell: cell-385b06349b0b7c8f Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer0_2)).encode("utf-8")+b"9d38").hexdigest() == "d650cc3d6f7a40af446f117fb9a48c5e455107f4", "type of answer0_2 is not str. answer0_2 should be an str"
assert sha1(str(len(answer0_2)).encode("utf-8")+b"9d38").hexdigest() == "e6a09c0eca2a548795c34b08e168b5e179250461", "length of answer0_2 is not correct"
assert sha1(str(answer0_2.lower()).encode("utf-8")+b"9d38").hexdigest() == "ea78c9bf57f89198ceb1e280121135f2b68867b4", "value of answer0_2 is not correct"
assert sha1(str(answer0_2).encode("utf-8")+b"9d38").hexdigest() == "c5e8541048d7fa22d0939197bb884c84f8fa06b1", "correct string value of answer0_2 but incorrect case of letters"

print('Success!')
Success!

Question 0.3 Multiple Choice:
{points: 1}

For which scenario would using one of the groupby + mean be appropriate?

A. To apply the same function to every row.

B. To apply the same function to every column.

C. To apply the same function to groups of rows.

D. To apply the same function to groups of columns.

Assign your answer to an object called answer0_3. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [8]:
Student's answer(Top)
answer0_3 = "C"
In [9]:
Grade cell: cell-386ada4b41ae9cae Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer0_3)).encode("utf-8")+b"f2e56").hexdigest() == "e0e3861cc1a9c49bc790d0703a0dd263658dc787", "type of answer0_3 is not str. answer0_3 should be an str"
assert sha1(str(len(answer0_3)).encode("utf-8")+b"f2e56").hexdigest() == "8b78fbb97c9ee09546aa0db2de1f4263070d5862", "length of answer0_3 is not correct"
assert sha1(str(answer0_3.lower()).encode("utf-8")+b"f2e56").hexdigest() == "f081aa9c76471c92b14419d00ca58b6e3ca80a2d", "value of answer0_3 is not correct"
assert sha1(str(answer0_3).encode("utf-8")+b"f2e56").hexdigest() == "0cdd5d54853cd7d244c801f01d21cfeb8c0efa33", "correct string value of answer0_3 but incorrect case of letters"

print('Success!')
Success!

1. Assessing avocado prices to inform restaurant menu planning

It is a well known that millennials LOVE avocado toast (joking...well mostly 😉), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes.

No description has been provided for this image

Source: https://www.averiecooks.com/egg-hole-avocado-toast/

To answer this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. The data set spans the year 2015-2018.

Some relevant columns in the dataset:

  • Date - The date in year-month-day format
  • average_price - The average price of a single avocado
  • type - conventional or organic
  • yr - The year
  • region - The city or region of the observation
  • small_hass_volume in pounds (lbs)
  • large_hass_volume in pounds (lbs)
  • extra_l_hass_volume in pounds (lbs)
  • wk - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52).

To answer our question of whether there are times in the year when avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of average_price (y-axis) versus Date (x-axis).

Question 1.1 Multiple Choice:
{points: 1}

Which of the following is not included in the csv file?

A. Average price of a single avocado.

B. The farming practice (production with/without the use of chemicals).

C. Average price of a bag of avocados.

D. All options are included in the data set.

Assign your answer to an object called answer1_1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [10]:
Student's answer(Top)
answer1_1 = "C"
In [11]:
Grade cell: cell-1c278f180e20468f Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"adc95").hexdigest() == "f65cd5d5c9ff679b669dcd967b97b3b8744ef959", "type of answer1_1 is not str. answer1_1 should be an str"
assert sha1(str(len(answer1_1)).encode("utf-8")+b"adc95").hexdigest() == "6d7c58889cee9ce413e8290664d8cd9f3276bee3", "length of answer1_1 is not correct"
assert sha1(str(answer1_1.lower()).encode("utf-8")+b"adc95").hexdigest() == "baf7f8244b08aafba1898998a6166eab4338d5dd", "value of answer1_1 is not correct"
assert sha1(str(answer1_1).encode("utf-8")+b"adc95").hexdigest() == "fde8f6fbcb1a8349286a9f350d89d90b24b75139", "correct string value of answer1_1 but incorrect case of letters"

print('Success!')
Success!

Question 1.2 Multiple Choice:
{points: 1}

The rows in the data frame represent:

A. daily avocado sales data for a region

B. weekly avocado sales data for a region

C. bi-weekly avocado sales data for a region

D. yearly avocado sales data for a region

Assign your answer to an object called answer1_2. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [12]:
Student's answer(Top)
answer1_2 = "B"
In [13]:
Grade cell: cell-6adfe52857aa9333 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer1_2)).encode("utf-8")+b"55730").hexdigest() == "a1ec9df8d575f97226f3ef3f3c21bb8498729858", "type of answer1_2 is not str. answer1_2 should be an str"
assert sha1(str(len(answer1_2)).encode("utf-8")+b"55730").hexdigest() == "437c87eb99d4120a72246ae26d1689ae5bcaaf8a", "length of answer1_2 is not correct"
assert sha1(str(answer1_2.lower()).encode("utf-8")+b"55730").hexdigest() == "796d6c42decf08a8c2f0af4383885e1d62896c0e", "value of answer1_2 is not correct"
assert sha1(str(answer1_2).encode("utf-8")+b"55730").hexdigest() == "4a1d8848530a58420062ace1c4374c9b867add62", "correct string value of answer1_2 but incorrect case of letters"

print('Success!')
Success!

Question 1.3
{points: 1}

The first step to plotting total volume against average price is to read the file avocado_prices.csv using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the data directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate .read_* function to read the data.

Assign your answer to an object called avocado.

In [14]:
Student's answer(Top)
avocado = pd.read_csv("data/avocado_prices.csv")
avocado
Out[14]:
Date average_price small_hass_volume large_hass_volume extra_l_hass_volume type yr region wk
0 2015-12-27 1.33 1036.74 54454.85 48.16 conventional 2015 Albany 52
1 2015-12-20 1.35 674.28 44638.81 58.33 conventional 2015 Albany 51
2 2015-12-13 0.93 794.70 109149.67 130.50 conventional 2015 Albany 50
3 2015-12-06 1.08 1132.00 71976.41 72.58 conventional 2015 Albany 49
4 2015-11-29 1.28 941.48 43838.39 75.78 conventional 2015 Albany 48
... ... ... ... ... ... ... ... ... ...
17906 2018-02-04 1.63 2046.96 1529.20 0.00 organic 2018 WestTexNewMexico 5
17907 2018-01-28 1.71 1191.70 3431.50 0.00 organic 2018 WestTexNewMexico 4
17908 2018-01-21 1.87 1191.92 2452.79 727.94 organic 2018 WestTexNewMexico 3
17909 2018-01-14 1.93 1527.63 2981.04 727.01 organic 2018 WestTexNewMexico 2
17910 2018-01-07 1.62 2894.77 2356.13 224.53 organic 2018 WestTexNewMexico 1

17911 rows × 9 columns

In [15]:
Grade cell: cell-a968fbd8b038ba4b Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(avocado is None)).encode("utf-8")+b"6acd8").hexdigest() == "b2baa38e33d93257c58de862f936f5ba856a9080", "type of avocado is None is not bool. avocado is None should be a bool"
assert sha1(str(avocado is None).encode("utf-8")+b"6acd8").hexdigest() == "a19b19295d4381fe86d0641c5dfdab3c386b3955", "boolean value of avocado is None is not correct"

assert sha1(str(type(avocado)).encode("utf-8")+b"6acd9").hexdigest() == "a48e42718092760e926abc03310667e3a0f00d09", "type of type(avocado) is not correct"

assert sha1(str(type(avocado.shape)).encode("utf-8")+b"6acda").hexdigest() == "ed8665e17dd11a229deec6ee019d803aefd5e0e5", "type of avocado.shape is not tuple. avocado.shape should be a tuple"
assert sha1(str(len(avocado.shape)).encode("utf-8")+b"6acda").hexdigest() == "c33da051c49e2479caed3c55c8d82d5cf251330a", "length of avocado.shape is not correct"
assert sha1(str(sorted(map(str, avocado.shape))).encode("utf-8")+b"6acda").hexdigest() == "557bdee675b6eddd909cb67785792a251e2285a3", "values of avocado.shape are not correct"
assert sha1(str(avocado.shape).encode("utf-8")+b"6acda").hexdigest() == "6011169e0ad9d69a932ba6d9fcd647e24ef44d8f", "order of elements of avocado.shape is not correct"

assert sha1(str(type(avocado.columns.values)).encode("utf-8")+b"6acdb").hexdigest() == "34dcabce5192dd1288fcf3c699eba11ff83e1002", "type of avocado.columns.values is not correct"
assert sha1(str(avocado.columns.values).encode("utf-8")+b"6acdb").hexdigest() == "e5796a5d59b161051367bdb984ebf4ccde8e60ca", "value of avocado.columns.values is not correct"

print('Success!')
Success!

Question 1.4

{points: 1}

To answer our question, let's now create the scatter plot where we plot average_price on the y-axis versus Date on the x-axis. Fill in the ___ in the cell below.

Assign your answer to an object called avocado_plot. Don't forget to create proper English axis labels.

In [16]:
Student's answer(Top)
avocado_plot = alt.Chart(avocado).mark_point().encode(
    x=alt.X("Date").title("Date"),
    y=alt.Y("average_price").title("Average price")
)

avocado_plot
Out[16]:
In [17]:
Grade cell: cell-89b7338558e28dbc Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(avocado_plot is None)).encode("utf-8")+b"d01c6").hexdigest() == "dde1d26a72d11da52245c9a692fe91343ff30fa3", "type of avocado_plot is None is not bool. avocado_plot is None should be a bool"
assert sha1(str(avocado_plot is None).encode("utf-8")+b"d01c6").hexdigest() == "741b89796593288e81b282ee73242052c7bd2cc8", "boolean value of avocado_plot is None is not correct"

assert sha1(str(type(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"d01c7").hexdigest() == "a5e38644b8bdc130ca2fa66ee08ee2b7cd921ff0", "type of avocado_plot.encoding.x['shorthand'] is not str. avocado_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"d01c7").hexdigest() == "34aea90b474a654b86d48b81091cd4cf18024b2d", "length of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"d01c7").hexdigest() == "e7b49902c74ff68a7124add9feb362a5aac6838a", "value of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand']).encode("utf-8")+b"d01c7").hexdigest() == "4ed2e3c53a5acb8a25ddae6d3cbb6dbad71b6545", "correct string value of avocado_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"d01c8").hexdigest() == "75d605640e721e4760e255544c7705c3fe27d91b", "type of avocado_plot.encoding.y['shorthand'] is not str. avocado_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"d01c8").hexdigest() == "e68d6605bc7815b0e952c2faf6bd885eac988f67", "length of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"d01c8").hexdigest() == "c63b397e44c2a93fd259986fd7aa5b72c8a102c5", "value of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand']).encode("utf-8")+b"d01c8").hexdigest() == "c63b397e44c2a93fd259986fd7aa5b72c8a102c5", "correct string value of avocado_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.mark)).encode("utf-8")+b"d01c9").hexdigest() == "851d26470cf88a5a3eed67c15a8ce7e307a4a263", "type of avocado_plot.mark is not str. avocado_plot.mark should be an str"
assert sha1(str(len(avocado_plot.mark)).encode("utf-8")+b"d01c9").hexdigest() == "b8a4a8d391e25418752a1bf4f115c22398a6963a", "length of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark.lower()).encode("utf-8")+b"d01c9").hexdigest() == "0f9a5c9e742f0214e7c8353bbb77b6c95deda51d", "value of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark).encode("utf-8")+b"d01c9").hexdigest() == "0f9a5c9e742f0214e7c8353bbb77b6c95deda51d", "correct string value of avocado_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_plot.encoding.y['title'], str))).encode("utf-8")+b"d01ca").hexdigest() == "037ccfa0569aa3afa47349d0753960131cdf68aa", "type of isinstance(avocado_plot.encoding.y['title'], str) is not bool. isinstance(avocado_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.y['title'], str)).encode("utf-8")+b"d01ca").hexdigest() == "3619e528a69a9a2f56554cd5c2bf92ee25695af0", "boolean value of isinstance(avocado_plot.encoding.y['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_plot.encoding.x['title'], str))).encode("utf-8")+b"d01cb").hexdigest() == "f56b31777fc70ec2630667486cdb16b7adf1f0b6", "type of isinstance(avocado_plot.encoding.x['title'], str) is not bool. isinstance(avocado_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.x['title'], str)).encode("utf-8")+b"d01cb").hexdigest() == "c6f57a38d4c74f310cc69717a378432d2bd3fc8c", "boolean value of isinstance(avocado_plot.encoding.x['title'], str) is not correct"

print('Success!')
Success!

This is a big plot! You can scroll and maybe see some trends, but really what we see in the plot above is not very informative. Why? Because there is a lot of overplotting (data points sitting on top of other data points). What can we do? One solution is to reduce/aggregate the data in a meaningful way to help anwer our question. Remember that we are interested in determining if there are times when the price of avocados are less expensive so that we can recommend when restaurants should put dishes on the menu that contain avocado to maximize their profits for those dishes.

In the data we plotted above, each row is the total sales for avocados for that region for each year. Lets use .groupby + .mean calculate the average price for each week across years and region. We can then plot that aggregated price against the week and perhaps get a clearer picture.

Question 1.5
{points: 1}

Create a reduced/aggregated version of the avocado data set and name it avocado_aggregate. To do this you will want to groupby the wk column and then use mean to calculate the average price. We pass numeric_only=True to tell pandas that we want the mean only of the numeric columns. Note: after applying groupby to the dataframe, it will automatically set the groupby column as index. Since we would like to use the wk column later in the plot, we would apply reset_index to reset the index for the dataframe.

Assign your answer to an object called avocado_aggregate.

In [18]:
Student's answer(Top)
avocado_aggregate = avocado.groupby("wk").mean(numeric_only=True).reset_index()

avocado_aggregate.head()
Out[18]:
wk average_price small_hass_volume large_hass_volume extra_l_hass_volume yr
0 1 1.286887 191342.390307 218561.677642 14571.720660 2016.5
1 2 1.330519 180736.856368 205410.353231 13433.732241 2016.5
2 3 1.341415 186900.340259 210345.864906 14890.573939 2016.5
3 4 1.315047 183383.858160 206762.362052 14990.596557 2016.5
4 5 1.253608 250123.511250 258856.288561 19160.686274 2016.5
In [19]:
Grade cell: cell-81ec3e479caeb7d7 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(avocado_aggregate is None)) == "<class 'bool'>", "type of avocado_aggregate is None is not bool. avocado_aggregate is None should be a bool"
assert str(avocado_aggregate is None) == "False", "boolean value of avocado_aggregate is None is not correct"

assert str(type(avocado_aggregate.shape)) == "<class 'tuple'>", "type of avocado_aggregate.shape is not tuple. avocado_aggregate.shape should be a tuple"
assert str(len(avocado_aggregate.shape)) == "2", "length of avocado_aggregate.shape is not correct"
assert str(sorted(map(str, avocado_aggregate.shape))) == "['53', '6']", "values of avocado_aggregate.shape are not correct"
assert str(avocado_aggregate.shape) == "(53, 6)", "order of elements of avocado_aggregate.shape is not correct"

assert sha1(str(type(sum(avocado_aggregate.wk))).encode("utf-8")+b"340d8").hexdigest() == "fa5101c2ce9c6bcd9caa24873dc498b4c845271c", "type of sum(avocado_aggregate.wk) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(avocado_aggregate.wk)).encode("utf-8")+b"340d8").hexdigest() == "c3e744cd11d58fd27bae1e0dffa0b99e26cd68f2", "value of sum(avocado_aggregate.wk) is not correct"

assert sha1(str(type(sum(avocado_aggregate.average_price))).encode("utf-8")+b"340d9").hexdigest() == "a3a4fb7c38647c3509434b1746ed23d846e8a985", "type of sum(avocado_aggregate.average_price) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado_aggregate.average_price), 2)).encode("utf-8")+b"340d9").hexdigest() == "7ca15cadaada34ff44fd491bb9f2382c582e59a7", "value of sum(avocado_aggregate.average_price) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 1.6
{points: 1}

Now let's take the avocado_aggregate data frame and use it to create a scatter plot where we plot average_price on the y-axis versus wk on the x-axis.

Assign your answer to an object called avocado_aggregate_plot. Don't forget to create proper English axis titles.

In [20]:
Student's answer(Top)
avocado_aggregate_plot = alt.Chart(avocado_aggregate).mark_point().encode(
    x=alt.X("wk").title("Week"),
    y=alt.Y("average_price")
        .title("Average Price")
        .scale(zero=False)
)

avocado_aggregate_plot
Out[20]:
In [21]:
Grade cell: cell-d70b07b4c2dc0202 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(avocado_aggregate_plot is None)).encode("utf-8")+b"ee7b1").hexdigest() == "68bc35a4d9ab511f1dd9e6d8d6a03a206a1bc070", "type of avocado_aggregate_plot is None is not bool. avocado_aggregate_plot is None should be a bool"
assert sha1(str(avocado_aggregate_plot is None).encode("utf-8")+b"ee7b1").hexdigest() == "9f01ad8cc1df86e054786832b41236c66d815d01", "boolean value of avocado_aggregate_plot is None is not correct"

assert sha1(str(type(avocado_aggregate_plot.encoding.x['shorthand'])).encode("utf-8")+b"ee7b2").hexdigest() == "5521da4cf37f64e0d17d95700570ff2f737224e2", "type of avocado_aggregate_plot.encoding.x['shorthand'] is not str. avocado_aggregate_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot.encoding.x['shorthand'])).encode("utf-8")+b"ee7b2").hexdigest() == "4b403cd13f6161a6a5233c86d7ccda2b8703a687", "length of avocado_aggregate_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"ee7b2").hexdigest() == "17292a184b887518f4af9d28d2b27c63c919187a", "value of avocado_aggregate_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.x['shorthand']).encode("utf-8")+b"ee7b2").hexdigest() == "17292a184b887518f4af9d28d2b27c63c919187a", "correct string value of avocado_aggregate_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot.encoding.y['shorthand'])).encode("utf-8")+b"ee7b3").hexdigest() == "72c647ee9e5ebf070076bd85e1eb101229cbced4", "type of avocado_aggregate_plot.encoding.y['shorthand'] is not str. avocado_aggregate_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot.encoding.y['shorthand'])).encode("utf-8")+b"ee7b3").hexdigest() == "79d24109ea8b1d5d16c568ead4f9d483d56a016e", "length of avocado_aggregate_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"ee7b3").hexdigest() == "15cc9c1ca04225d143c8de761f67c36c2ce0c715", "value of avocado_aggregate_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.y['shorthand']).encode("utf-8")+b"ee7b3").hexdigest() == "15cc9c1ca04225d143c8de761f67c36c2ce0c715", "correct string value of avocado_aggregate_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot.mark)).encode("utf-8")+b"ee7b4").hexdigest() == "cf2472bab101fc43fdd27cc1bac0663a1066968c", "type of avocado_aggregate_plot.mark is not str. avocado_aggregate_plot.mark should be an str"
assert sha1(str(len(avocado_aggregate_plot.mark)).encode("utf-8")+b"ee7b4").hexdigest() == "f96b7704e1fea8d8c7ba43064c8698781e4d3c6a", "length of avocado_aggregate_plot.mark is not correct"
assert sha1(str(avocado_aggregate_plot.mark.lower()).encode("utf-8")+b"ee7b4").hexdigest() == "ec39c9cb890f82dfc1c48022366a7ea84b9b04da", "value of avocado_aggregate_plot.mark is not correct"
assert sha1(str(avocado_aggregate_plot.mark).encode("utf-8")+b"ee7b4").hexdigest() == "ec39c9cb890f82dfc1c48022366a7ea84b9b04da", "correct string value of avocado_aggregate_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_aggregate_plot.encoding.x['title'], str))).encode("utf-8")+b"ee7b5").hexdigest() == "0769639de12b8d44c8842f99638c960e6ba61fd1", "type of isinstance(avocado_aggregate_plot.encoding.x['title'], str) is not bool. isinstance(avocado_aggregate_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot.encoding.x['title'], str)).encode("utf-8")+b"ee7b5").hexdigest() == "69063b05a123a148808e399713d28c9b40f6af47", "boolean value of isinstance(avocado_aggregate_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_aggregate_plot.encoding.y['title'], str))).encode("utf-8")+b"ee7b6").hexdigest() == "074b76890681b3a592d4c558daf07dc282a477c3", "type of isinstance(avocado_aggregate_plot.encoding.y['title'], str) is not bool. isinstance(avocado_aggregate_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot.encoding.y['title'], str)).encode("utf-8")+b"ee7b6").hexdigest() == "a76c595333d58db173f70033fb1d1fe7a4f81ea4", "boolean value of isinstance(avocado_aggregate_plot.encoding.y['title'], str) is not correct"

print('Success!')
Success!

We can now see that the prices of avocados does indeed fluctuate throughout the year. And we could use this information to recommend to restaurants that if they want to maximize profit from menu items that contain avocados, they should only offer them on the menu roughly between December and May.

Why might this happen? Perhaps price has something to do with supply? We can also use this data set to get some insight into that question by plotting total avocado volume (y-axis) versus week. To do this, we will first have to create a column called total_volume whose value is the sum of the small, large and extra large-sized avocado volumes. To do this we will have to go back to the original avocado data frame we loaded.

Question 1.7
{points: 1}

Our next step to plotting total_volume per week against week is to create a new column in the avocado data frame called total_volume which is equal to the sum of all three volume columns:

Fill in the ___ in the cell below.

In [22]:
Student's answer(Top)
avocado = avocado.assign(total_volume= 
        avocado["small_hass_volume"] + 
        avocado["large_hass_volume"] + 
        avocado["extra_l_hass_volume"])

avocado
Out[22]:
Date average_price small_hass_volume large_hass_volume extra_l_hass_volume type yr region wk total_volume
0 2015-12-27 1.33 1036.74 54454.85 48.16 conventional 2015 Albany 52 55539.75
1 2015-12-20 1.35 674.28 44638.81 58.33 conventional 2015 Albany 51 45371.42
2 2015-12-13 0.93 794.70 109149.67 130.50 conventional 2015 Albany 50 110074.87
3 2015-12-06 1.08 1132.00 71976.41 72.58 conventional 2015 Albany 49 73180.99
4 2015-11-29 1.28 941.48 43838.39 75.78 conventional 2015 Albany 48 44855.65
... ... ... ... ... ... ... ... ... ... ...
17906 2018-02-04 1.63 2046.96 1529.20 0.00 organic 2018 WestTexNewMexico 5 3576.16
17907 2018-01-28 1.71 1191.70 3431.50 0.00 organic 2018 WestTexNewMexico 4 4623.20
17908 2018-01-21 1.87 1191.92 2452.79 727.94 organic 2018 WestTexNewMexico 3 4372.65
17909 2018-01-14 1.93 1527.63 2981.04 727.01 organic 2018 WestTexNewMexico 2 5235.68
17910 2018-01-07 1.62 2894.77 2356.13 224.53 organic 2018 WestTexNewMexico 1 5475.43

17911 rows × 10 columns

In [23]:
Grade cell: cell-1b331febb2ce27b5 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(avocado is None)) == "<class 'bool'>", "type of avocado is None is not bool. avocado is None should be a bool"
assert str(avocado is None) == "False", "boolean value of avocado is None is not correct"

assert str(type(avocado.shape)) == "<class 'tuple'>", "type of avocado.shape is not tuple. avocado.shape should be a tuple"
assert str(len(avocado.shape)) == "2", "length of avocado.shape is not correct"
assert str(sorted(map(str, avocado.shape))) == "['10', '17911']", "values of avocado.shape are not correct"
assert str(avocado.shape) == "(17911, 10)", "order of elements of avocado.shape is not correct"

assert sha1(str(type(sum(avocado.total_volume.dropna()))).encode("utf-8")+b"dfd11").hexdigest() == "afea9d4a14fbfd8a01933be09874afa870f11c88", "type of sum(avocado.total_volume.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado.total_volume.dropna()), 2)).encode("utf-8")+b"dfd11").hexdigest() == "d1bc6d4e7ae96bf5f676a3abeb330631ef28ad57", "value of sum(avocado.total_volume.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 1.8
{points: 1}

Now, create another reduced/aggregated version of the avocado data frame and name it avocado_aggregate_2. To do this you will want to groupby the wk column and then use mean to calculate the average total volume.

In [24]:
Student's answer(Top)
avocado_aggregate_2 = avocado.groupby("wk").mean(numeric_only=True).reset_index()
avocado_aggregate_2.head()
Out[24]:
wk average_price small_hass_volume large_hass_volume extra_l_hass_volume yr total_volume
0 1 1.286887 191342.390307 218561.677642 14571.720660 2016.5 424475.788608
1 2 1.330519 180736.856368 205410.353231 13433.732241 2016.5 399580.941840
2 3 1.341415 186900.340259 210345.864906 14890.573939 2016.5 412136.779104
3 4 1.315047 183383.858160 206762.362052 14990.596557 2016.5 405136.816769
4 5 1.253608 250123.511250 258856.288561 19160.686274 2016.5 528140.486085
In [25]:
Grade cell: cell-975338ad4661f5af Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(avocado_aggregate_2 is None)) == "<class 'bool'>", "type of avocado_aggregate_2 is None is not bool. avocado_aggregate_2 is None should be a bool"
assert str(avocado_aggregate_2 is None) == "False", "boolean value of avocado_aggregate_2 is None is not correct"

assert str(type(avocado_aggregate_2.shape)) == "<class 'tuple'>", "type of avocado_aggregate_2.shape is not tuple. avocado_aggregate_2.shape should be a tuple"
assert str(len(avocado_aggregate_2.shape)) == "2", "length of avocado_aggregate_2.shape is not correct"
assert str(sorted(map(str, avocado_aggregate_2.shape))) == "['53', '7']", "values of avocado_aggregate_2.shape are not correct"
assert str(avocado_aggregate_2.shape) == "(53, 7)", "order of elements of avocado_aggregate_2.shape is not correct"

assert sha1(str(type(sum(avocado_aggregate_2.total_volume))).encode("utf-8")+b"b58a8").hexdigest() == "46c97f351b15286235205e1d318e8bacd90e0bfc", "type of sum(avocado_aggregate_2.total_volume) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado_aggregate_2.total_volume), 2)).encode("utf-8")+b"b58a8").hexdigest() == "b11a12de907729e2a42a6dcad3f7c9b0d738e2c2", "value of sum(avocado_aggregate_2.total_volume) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(avocado_aggregate_2.wk))).encode("utf-8")+b"b58a9").hexdigest() == "426a08d984d879a010fbb867392006be851035a4", "type of sum(avocado_aggregate_2.wk) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(avocado_aggregate_2.wk)).encode("utf-8")+b"b58a9").hexdigest() == "7eccd55d058de141e51c5f13fd5780b784dc4ab9", "value of sum(avocado_aggregate_2.wk) is not correct"

print('Success!')
Success!

Question 1.10
{points: 1}

Now let's take the avocado_aggregate_2 data frame and use it to create a scatter plot where we plot average total_volume (in pounds, lbs) on the y-axis versus wk on the x-axis. Assign your answer to an object called avocado_aggregate_plot_2. Don't forget to create proper English axis labels.

Hint: don't forget to include the units for volume in your axis titles.

In [26]:
Student's answer(Top)
avocado_aggregate_plot_2 = alt.Chart(avocado_aggregate_2).mark_point().encode(
    x=alt.X("wk").title("Week"),
    y=alt.Y("total_volume")
        .title("Total Volume (in pounds, lbs)")
        .scale(zero=False)
)
avocado_aggregate_plot_2
Out[26]:
In [27]:
Grade cell: cell-7a0b7fca31c9c8ec Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(avocado_aggregate_plot_2 is None)).encode("utf-8")+b"dcf85").hexdigest() == "1000b4896ee8916cdb9111997fee68e425bf386f", "type of avocado_aggregate_plot_2 is None is not bool. avocado_aggregate_plot_2 is None should be a bool"
assert sha1(str(avocado_aggregate_plot_2 is None).encode("utf-8")+b"dcf85").hexdigest() == "93779c808f1c940cecf4fe80439e602196c82b7b", "boolean value of avocado_aggregate_plot_2 is None is not correct"

assert sha1(str(type(avocado_aggregate_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"dcf86").hexdigest() == "633af79f019fcdaf45e2aad1964b816f1aa65f0d", "type of avocado_aggregate_plot_2.encoding.x['shorthand'] is not str. avocado_aggregate_plot_2.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"dcf86").hexdigest() == "6fbdedb4b66f72083f15fadde936638486a422ae", "length of avocado_aggregate_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.x['shorthand'].lower()).encode("utf-8")+b"dcf86").hexdigest() == "24d6097da639d34676edf292ab524138b45c160f", "value of avocado_aggregate_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.x['shorthand']).encode("utf-8")+b"dcf86").hexdigest() == "24d6097da639d34676edf292ab524138b45c160f", "correct string value of avocado_aggregate_plot_2.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot_2.encoding.y['shorthand'])).encode("utf-8")+b"dcf87").hexdigest() == "af9860577ea0ac33271786308aca6bf013a9e290", "type of avocado_aggregate_plot_2.encoding.y['shorthand'] is not str. avocado_aggregate_plot_2.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.encoding.y['shorthand'])).encode("utf-8")+b"dcf87").hexdigest() == "b826131859c813af369cbe2c526db63d156fee65", "length of avocado_aggregate_plot_2.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.y['shorthand'].lower()).encode("utf-8")+b"dcf87").hexdigest() == "307d50093ad42ee827a0cfc192818f17bd77f21b", "value of avocado_aggregate_plot_2.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.y['shorthand']).encode("utf-8")+b"dcf87").hexdigest() == "307d50093ad42ee827a0cfc192818f17bd77f21b", "correct string value of avocado_aggregate_plot_2.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot_2.mark)).encode("utf-8")+b"dcf88").hexdigest() == "c7638f0dcd672dd538dd2a71d094f250ae0a258e", "type of avocado_aggregate_plot_2.mark is not str. avocado_aggregate_plot_2.mark should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.mark)).encode("utf-8")+b"dcf88").hexdigest() == "22695734e929a42a547467c8d8331811c0aaad44", "length of avocado_aggregate_plot_2.mark is not correct"
assert sha1(str(avocado_aggregate_plot_2.mark.lower()).encode("utf-8")+b"dcf88").hexdigest() == "1cffffcd8c4330ff0c9341e6f2216217915192ee", "value of avocado_aggregate_plot_2.mark is not correct"
assert sha1(str(avocado_aggregate_plot_2.mark).encode("utf-8")+b"dcf88").hexdigest() == "1cffffcd8c4330ff0c9341e6f2216217915192ee", "correct string value of avocado_aggregate_plot_2.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_aggregate_plot_2.encoding.x['title'], str))).encode("utf-8")+b"dcf89").hexdigest() == "f455fd857f6d1d22422f5a7e837a998ab1b67867", "type of isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) is not bool. isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot_2.encoding.x['title'], str)).encode("utf-8")+b"dcf89").hexdigest() == "c082b53bf19587e4361ebf74d235e238e4209f90", "boolean value of isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_aggregate_plot_2.encoding.y['title'], str))).encode("utf-8")+b"dcf8a").hexdigest() == "9572d46de3569055d0fa8f9003c8d8565f22b0b3", "type of isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) is not bool. isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot_2.encoding.y['title'], str)).encode("utf-8")+b"dcf8a").hexdigest() == "3d6ac1ff63091289c9707270c3092aadc65c85cd", "boolean value of isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) is not correct"

print('Success!')
Success!

We can see from the above plot of the average total volume versus the week that there are more avocados sold (and perhaps this reflects what is available for sale) roughly between January to May. This time period of increased volume corresponds with the lower avocado prices. We can hypothesize (but not conclude, of course) that the lower prices may be due to an increased availability of avocados during this time period.

2. Sea Surface Temperatures in Departure Bay

The next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (in degrees Celsius) and salinity (in practical salinity units, PSU) observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

Presently termed the British Columbia Shore Station Oceanographic Program (BCSOP), there are 12 participating stations; most of these are staffed by Fisheries and Oceans Canada. You can look at data from other stations at http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html

Further information from the Government of Canada's website indicates:

Observations are made daily using seawater collected in a bucket lowered into the surface water at or near the daytime high tide. This sampling method was designed long ago by Dr. John P. Tully and has not been changed in the interests of a homogeneous data set. This means, for example, that if an observer starts sampling one day at 6 a.m., and continues to sample at the daytime high tide on the second day the sample will be taken at about 06:50 the next day, 07:40 the day after etc. When the daytime high-tide gets close to 6 p.m. the observer will then begin again to sample early in the morning, and the cycle continues. Since there is a day/night variation in the sea surface temperatures the daily time series will show a signal that varies with the14-day tidal cycle. This artifact does not affect the monthly sea surface temperature data.

In this worksheet, we want to see if the sea surface temperature has been changing over time.

Question 2.1 True or False:
{points: 1}

The sampling of surface water occurs at the same time each day.

Assign your answer to an object called answer2_1. Make sure your answer is a boolean. i.e. True or False.

In [28]:
Student's answer(Top)
answer2_1 = False
In [29]:
Grade cell: cell-aef90db69249870d Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer2_1)).encode("utf-8")+b"5b16").hexdigest() == "24efa47aa702b66d41ff5b03a9293652dae80022", "type of answer2_1 is not bool. answer2_1 should be a bool"
assert sha1(str(answer2_1).encode("utf-8")+b"5b16").hexdigest() == "1a8f156ccdceabc61b08768989a4a774bc02ef88", "boolean value of answer2_1 is not correct"

print('Success!')
Success!

Question 2.2 Multiple Choice:
{points: 1}

If high tide occurred at 9am today, what time would the scientist collect data tomorrow?

A. 11:10 am

B. 9:50 am

C. 10:00 pm

D. Trick question... you skip days when collecting data.

Assign your answer to an object called answer2_2. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [30]:
Student's answer(Top)
answer2_2 = "B"
In [31]:
Grade cell: cell-957f2b6edf976bfd Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer2_2)).encode("utf-8")+b"7f116").hexdigest() == "acc27b936a1b95c948390aa4fce5c4ae589798e4", "type of answer2_2 is not str. answer2_2 should be an str"
assert sha1(str(len(answer2_2)).encode("utf-8")+b"7f116").hexdigest() == "05434c0edcaff3928219aba8e9c341853a619b65", "length of answer2_2 is not correct"
assert sha1(str(answer2_2.lower()).encode("utf-8")+b"7f116").hexdigest() == "68cac5977752c1d8f68e0c2e94efdc40c54fefa9", "value of answer2_2 is not correct"
assert sha1(str(answer2_2).encode("utf-8")+b"7f116").hexdigest() == "32c6e88a0d9812b4507fe42884d58c3cc1437a23", "correct string value of answer2_2 but incorrect case of letters"

print('Success!')
Success!

Question 2.3
{points: 1}

To begin working with this data, read the file departure_bay_temperature.csv using a relative path. Note, this file (just like the avocado data set) is found within the data directory.

Assign your answer to an object called sea_surface.

Hint: check out the data file in the editor mode to see from which row the actual data begins, and you will need to specify the skiprows argument accordingly in the suitable pandas function.

In [32]:
Student's answer(Top)
sea_surface = pd.read_csv("data/departure_bay_temperature.csv", skiprows=2)
sea_surface
Out[32]:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1914 7.2 NaN NaN NaN NaN NaN NaN NaN 11.1 10.0 7.3 6.3
1 1915 5.6 6.6 7.5 9.0 9.9 12.5 14.7 15.8 14.0 8.2 4.4 4.1
2 1916 1.2 0.1 3.5 6.5 8.0 12.0 13.1 14.0 11.4 7.6 5.4 3.5
3 1917 3.8 2.8 4.4 5.4 8.3 11.0 13.7 12.2 10.0 8.6 7.0 4.9
4 1918 3.7 3.9 4.6 6.0 9.3 11.2 13.1 14.5 13.8 9.1 6.7 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
100 2014 4.4 3.1 3.7 7.7 10.1 12.0 15.9 16.8 11.7 10.6 7.1 7.1
101 2015 6.3 8.0 8.0 8.9 11.0 15.5 13.8 13.4 11.6 11.3 8.1 6.8
102 2016 6.0 7.1 8.4 9.8 13.0 14.2 14.6 14.6 12.6 10.8 8.2 5.5
103 2017 5.6 4.8 7.1 7.9 10.5 12.4 15.3 15.3 13.1 10.2 8.8 6.9
104 2018 6.2 6.0 7.1 8.2 NaN NaN NaN NaN NaN NaN NaN NaN

105 rows × 13 columns

In [33]:
Grade cell: cell-09a9fd7ca9f44ada Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(sea_surface is None)) == "<class 'bool'>", "type of sea_surface is None is not bool. sea_surface is None should be a bool"
assert str(sea_surface is None) == "False", "boolean value of sea_surface is None is not correct"

assert str(type(sea_surface)) == "<class 'pandas.core.frame.DataFrame'>", "type of type(sea_surface) is not correct"

assert str(type(sea_surface.shape)) == "<class 'tuple'>", "type of sea_surface.shape is not tuple. sea_surface.shape should be a tuple"
assert str(len(sea_surface.shape)) == "2", "length of sea_surface.shape is not correct"
assert str(sorted(map(str, sea_surface.shape))) == "['105', '13']", "values of sea_surface.shape are not correct"
assert str(sea_surface.shape) == "(105, 13)", "order of elements of sea_surface.shape is not correct"

assert str(type(sea_surface.columns.values)) == "<class 'numpy.ndarray'>", "type of sea_surface.columns.values is not correct"
assert str(sea_surface.columns.values) == "['Year' 'Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov'\n 'Dec']", "value of sea_surface.columns.values is not correct"

assert sha1(str(type(sum(sea_surface.Year))).encode("utf-8")+b"d73df").hexdigest() == "c09b364f129656a81460e3737b69d17997222b6d", "type of sum(sea_surface.Year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(sea_surface.Year)).encode("utf-8")+b"d73df").hexdigest() == "18eabd7428348c0050ed175e666f8d06b24fa681", "value of sum(sea_surface.Year) is not correct"

print('Success!')
Success!

Question 2.3.1
{points: 1}

The data above in Question 2.3 is not tidy, which reasons listed below explain why?

A. There are NaN's in the data set

B. The variable temperature is split across more than one column

C. Values for the variable month are stored as column names

D. A and C

E. B and C

F. All of the above

Assign your answer to an object called answer2_3_1.

In [34]:
Student's answer(Top)
answer2_3_1 = "E"
In [35]:
Grade cell: cell-f449a87635bac905 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer2_3_1)).encode("utf-8")+b"9ecf9").hexdigest() == "dffd9e88e5de4ec6a8048e7dc5bff8d7767ee1a7", "type of answer2_3_1 is not str. answer2_3_1 should be an str"
assert sha1(str(len(answer2_3_1)).encode("utf-8")+b"9ecf9").hexdigest() == "dfc38fbf0fbf380e6ed5fc6cb539750a5f28c910", "length of answer2_3_1 is not correct"
assert sha1(str(answer2_3_1.lower()).encode("utf-8")+b"9ecf9").hexdigest() == "f9b0508a55c3b6e0fa8d746167165be913493694", "value of answer2_3_1 is not correct"
assert sha1(str(answer2_3_1).encode("utf-8")+b"9ecf9").hexdigest() == "1e2db29fe6bab7c8b2b040fd72a847cb097bb590", "correct string value of answer2_3_1 but incorrect case of letters"

print('Success!')
Success!

Question 2.4
{points: 1}

Given altair expects tidy data, we need to convert our data into that format. To do this we will use the melt function. We would like our data to end up looking like this:

Year Month Temperature
1914 Jan 7.2
1915 Jan 5.6
1916 Jan 1.2
1917 Jan 3.8
1918 Jan 3.7
... ... ...
2014 Dec 7.1
2015 Dec 6.8
2016 Dec 5.5
2017 Dec 6.9
2018 Dec NaN

Fill in the ___ in the cell below.

Assign your answer to an object called tidy_temp.

In [36]:
Student's answer(Top)
tidy_temp = sea_surface.melt(id_vars=['Year'],  var_name='Month', value_name='Temperature')
tidy_temp
Out[36]:
Year Month Temperature
0 1914 Jan 7.2
1 1915 Jan 5.6
2 1916 Jan 1.2
3 1917 Jan 3.8
4 1918 Jan 3.7
... ... ... ...
1255 2014 Dec 7.1
1256 2015 Dec 6.8
1257 2016 Dec 5.5
1258 2017 Dec 6.9
1259 2018 Dec NaN

1260 rows × 3 columns

In [37]:
Grade cell: cell-afb070ca8361d0a7 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(tidy_temp is None)) == "<class 'bool'>", "type of tidy_temp is None is not bool. tidy_temp is None should be a bool"
assert str(tidy_temp is None) == "False", "boolean value of tidy_temp is None is not correct"

assert str(type(tidy_temp.shape)) == "<class 'tuple'>", "type of tidy_temp.shape is not tuple. tidy_temp.shape should be a tuple"
assert str(len(tidy_temp.shape)) == "2", "length of tidy_temp.shape is not correct"
assert str(sorted(map(str, tidy_temp.shape))) == "['1260', '3']", "values of tidy_temp.shape are not correct"
assert str(tidy_temp.shape) == "(1260, 3)", "order of elements of tidy_temp.shape is not correct"

assert str(type(tidy_temp.columns)) == "<class 'pandas.core.indexes.base.Index'>", "type of tidy_temp.columns is not correct"
assert str(tidy_temp.columns) == "Index(['Year', 'Month', 'Temperature'], dtype='object')", "value of tidy_temp.columns is not correct"

assert sha1(str(type(sum(tidy_temp.Temperature.dropna()))).encode("utf-8")+b"3719b").hexdigest() == "f0fa4ceb7fe8ca0463cae67107438059351abe65", "type of sum(tidy_temp.Temperature.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(tidy_temp.Temperature.dropna()), 2)).encode("utf-8")+b"3719b").hexdigest() == "7f9a5ab6fd78313350c71c17274fe1b912015e34", "value of sum(tidy_temp.Temperature.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 2.5
{points: 1}

Now that we have our data in a tidy format, we can create our plot that compares the average monthly sea surface temperatures (in degrees Celsius) to the year they were recorded. To make our plots more informative, we should plot each month separately. We can filter the data before we pipe our data into the alt.Chart function. Let's start out by just plotting the data for the month of November. As usual, use proper English to label your axes :)

Assign your answer to an object called nov_temp_plot.

Hint: don't forget to include the units for temperature in your data visualization.

In [38]:
Student's answer(Top)
nov_temp_plot = alt.Chart(tidy_temp[tidy_temp["Month"] == "Nov"]).mark_point().encode(
    x=alt.X("Year")
        .scale(zero=False),
    y=alt.Y("Temperature")
        .title("Temperature (Degrees Celsius)")
        .scale(zero=False)
)

nov_temp_plot
Out[38]:
In [39]:
Grade cell: cell-8ddfbb3c8b82e695 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(nov_temp_plot is None)).encode("utf-8")+b"83fc3").hexdigest() == "b8e6dbc24eed9fe83b6e7a333c2a27fb72165e06", "type of nov_temp_plot is None is not bool. nov_temp_plot is None should be a bool"
assert sha1(str(nov_temp_plot is None).encode("utf-8")+b"83fc3").hexdigest() == "f0f3f4be5817cffcf54192b1e98e4ce99af0010d", "boolean value of nov_temp_plot is None is not correct"

assert sha1(str(type(nov_temp_plot.data.Month.unique())).encode("utf-8")+b"83fc4").hexdigest() == "b2574d9f452cd156437d3edbb484748bf2923c25", "type of nov_temp_plot.data.Month.unique() is not correct"
assert sha1(str(nov_temp_plot.data.Month.unique()).encode("utf-8")+b"83fc4").hexdigest() == "7dfdac7d9b33b795e55cfb9047c4786bd05ca85b", "value of nov_temp_plot.data.Month.unique() is not correct"

assert sha1(str(type(nov_temp_plot.encoding.x['shorthand'])).encode("utf-8")+b"83fc5").hexdigest() == "4d0b821fd222dbd81134b79ac24186dd3be06a53", "type of nov_temp_plot.encoding.x['shorthand'] is not str. nov_temp_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(nov_temp_plot.encoding.x['shorthand'])).encode("utf-8")+b"83fc5").hexdigest() == "cf0dd8cf487d3793f3d9971ced8a0c035619efcd", "length of nov_temp_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"83fc5").hexdigest() == "15814f77bdefe2288fb67b908a2bdb77d587c5cf", "value of nov_temp_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.x['shorthand']).encode("utf-8")+b"83fc5").hexdigest() == "fdd7327f19226bfaf5ce750d9810aa7d0c718c65", "correct string value of nov_temp_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(nov_temp_plot.encoding.y['shorthand'])).encode("utf-8")+b"83fc6").hexdigest() == "dae711837620768f1a30f30341bd2ac159e37e88", "type of nov_temp_plot.encoding.y['shorthand'] is not str. nov_temp_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(nov_temp_plot.encoding.y['shorthand'])).encode("utf-8")+b"83fc6").hexdigest() == "b274cdedd4b2894f0c20c0b3befa023c7f88cfcb", "length of nov_temp_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"83fc6").hexdigest() == "dcb7f962e0be58c59cf1309735b3f6b55655d2ba", "value of nov_temp_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.y['shorthand']).encode("utf-8")+b"83fc6").hexdigest() == "1bd4f870cf12dfbe1057bc986ab6d9a851ec3f84", "correct string value of nov_temp_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(nov_temp_plot.mark)).encode("utf-8")+b"83fc7").hexdigest() == "00796992244bf10d5945833331f6aea7ae73c68f", "type of nov_temp_plot.mark is not str. nov_temp_plot.mark should be an str"
assert sha1(str(len(nov_temp_plot.mark)).encode("utf-8")+b"83fc7").hexdigest() == "ee6a22e160b6cadfd5d1d48b4a17e0b30e334edd", "length of nov_temp_plot.mark is not correct"
assert sha1(str(nov_temp_plot.mark.lower()).encode("utf-8")+b"83fc7").hexdigest() == "9a055120f6f2a876007f17585a818c01978a1651", "value of nov_temp_plot.mark is not correct"
assert sha1(str(nov_temp_plot.mark).encode("utf-8")+b"83fc7").hexdigest() == "9a055120f6f2a876007f17585a818c01978a1651", "correct string value of nov_temp_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(nov_temp_plot.encoding.y['title'], str))).encode("utf-8")+b"83fc8").hexdigest() == "4d32bc432f3b073e80b0432faf8dede18a800cbc", "type of isinstance(nov_temp_plot.encoding.y['title'], str) is not bool. isinstance(nov_temp_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(nov_temp_plot.encoding.y['title'], str)).encode("utf-8")+b"83fc8").hexdigest() == "d6b5dc6ad8e7c1e2bb6a634acd1a1ea5abdf8ecc", "boolean value of isinstance(nov_temp_plot.encoding.y['title'], str) is not correct"

print('Success!')
Success!

We can see that there may be a small decrease in colder temperatures in recent years, and/or the temperatures in recent years look less variable compared to years before 1975. What about other months? Let's plot them!

Instead of repeating the code above for the 11 other months, we'll take advantage of a altair function that we haven't met yet, facet. We will learn more about this function next week, this week we will give you the code for it.

Question 2.6
{points: 1}

Fill in the missing code below to plot the average monthly sea surface temperatures to the year they were recorded for all months.

Assign your answer to an object called all_temp_plot.

Hint: don't forget to include the units for temperature in your data visualization.

In [40]:
Student's answer(Top)
all_temp_plot = alt.Chart(tidy_temp).mark_point().encode(
    x=alt.X("Year")
        .scale(zero=False),
    y=alt.Y("Temperature")
        .title("Temperature (degrees Celsius)")
        .scale(zero=False)
).facet(
    'Month',
    columns=4,
)


all_temp_plot
Out[40]:
In [41]:
Grade cell: cell-4529c8b1eb657878 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(all_temp_plot is None)).encode("utf-8")+b"eac9d").hexdigest() == "6f46a4b9b2d29e91f1bf2236695627551197da41", "type of all_temp_plot is None is not bool. all_temp_plot is None should be a bool"
assert sha1(str(all_temp_plot is None).encode("utf-8")+b"eac9d").hexdigest() == "36468d296ef0f68e449c43447f89de73bffa8db3", "boolean value of all_temp_plot is None is not correct"

assert sha1(str(type("Month" in all_temp_plot.data.columns)).encode("utf-8")+b"eac9e").hexdigest() == "aca9062cc6d96821bf66d4d366df2cc43817b41d", "type of \"Month\" in all_temp_plot.data.columns is not bool. \"Month\" in all_temp_plot.data.columns should be a bool"
assert sha1(str("Month" in all_temp_plot.data.columns).encode("utf-8")+b"eac9e").hexdigest() == "6bc1555430d16c4b9517f7eaadc8226e6a992972", "boolean value of \"Month\" in all_temp_plot.data.columns is not correct"

assert sha1(str(type(all_temp_plot.facet)).encode("utf-8")+b"eac9f").hexdigest() == "a5874e0f9f26e63e6ba7b33fa9d77d909bcfb61e", "type of all_temp_plot.facet is not correct"
assert sha1(str(all_temp_plot.facet).encode("utf-8")+b"eac9f").hexdigest() == "30acd53f5df15174b2473769bd853e876ced2c5a", "value of all_temp_plot.facet is not correct"

print('Success!')
Success!

We can see above that some months show a small, but general increase in temperatures, whereas others don't. And some months show a change in variability and others do not. From this it is clear to us that if we are trying to understand temperature changes over time, we best keep data from different months separate. Also note that the months are sorted in alphabetic order, but it would have been better to sort it according to where during the year each month occurs, we will learn how to do this in an upcoming chapter!

3. Pollution in Madrid

We're working with a data set from Kaggle once again! This data was collected under the instructions from Madrid's City Council and is publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. This data includes daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties.

There are a number of stations set up around Madrid and each station's data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website:

  • SO_2: sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups.
  • CO: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death.
  • NO_2: nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation.
  • PM10: particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications.
  • NOx: nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog.
  • O_3: ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers.
  • TOL: toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage.
  • BEN: benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans.
  • EBE: ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer.
  • MXY: m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system.
  • PXY: p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
  • OXY: o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
  • TCH: total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases.
  • NMHC: non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans.

The goal of this assignment is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006).

  1. First do a plot of one of the pollutants (EBE).
  2. Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time).
  3. Now we will look at which pollutant decreased the most. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006.

Question 3.1 Multiple Choice:
{points: 1}

What big picture question are we trying to answer?

A. Did EBE decrease in Madrid between 2001 and 2006?

B. Of all the pollutants, which decreased the most between 2001 and 2006?

C. Of all the pollutants, which decreased the least between 2001 and 2006?

D. Did EBE increase in Madrid between 2001 and 2006?

Assign your answer to an object called answer3_1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [42]:
Student's answer(Top)
answer3_1 = "B"
In [43]:
Grade cell: cell-d67db6d2cd3971aa Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(all_temp_plot)).encode("utf-8")+b"67e07").hexdigest() == "2583e54a420f30fac78315f462c71dadcd40cb24", "type of all_temp_plot is not correct"
assert sha1(str(all_temp_plot).encode("utf-8")+b"67e07").hexdigest() == "def0c28d010741ed214fad9ac2606daff71ddf34", "value of all_temp_plot is not correct"

print('Success!')
Success!

Question 3.2
{points: 1}

To begin working with this data, read the file madrid_pollution.csv. Note, this file (just like the avocado and sea surface data set) is found in the data directory.

Assign your answer to an object called madrid.

Hint: check out the data file in the editor mode to see which delimitor is used, and then select the proper pandas function.

In [44]:
Student's answer(Top)
madrid = pd.read_csv("data/madrid_pollution.csv", sep="\t")
madrid
Out[44]:
date BEN CO EBE MXY NMHC NO_2 NOx OXY O_3 PM10 PXY SO_2 TCH TOL year mnth
0 2001-08-01T01:00:00Z 1.50 0.34 1.49 4.10 0.07 56.250000 75.169998 2.11 42.160000 100.599998 1.73 8.11 1.24 10.82 2001 August
1 2001-08-01T02:00:00Z 0.87 0.06 0.88 2.41 0.01 29.709999 31.440001 1.20 56.520000 56.290001 1.02 6.90 1.17 6.49 2001 August
2 2001-08-01T03:00:00Z 0.66 0.02 0.61 1.60 0.01 22.750000 22.459999 0.80 64.059998 36.650002 0.69 6.59 1.17 6.37 2001 August
3 2001-08-01T04:00:00Z 0.47 0.04 0.41 1.00 0.02 31.590000 34.770000 0.47 60.820000 25.820000 0.44 6.45 1.21 4.91 2001 August
4 2001-08-01T05:00:00Z 0.60 0.04 0.67 1.68 0.01 30.940001 32.509998 0.74 65.559998 31.100000 0.72 6.37 1.22 5.28 2001 August
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51859 2006-04-30T20:00:00Z 0.57 0.38 0.45 1.30 0.21 53.299999 69.879997 0.72 67.050003 16.270000 0.59 6.40 1.31 1.96 2006 April
51860 2006-04-30T21:00:00Z 0.82 0.35 0.66 1.66 0.20 45.810001 49.459999 0.88 64.419998 43.250000 0.72 5.96 1.31 3.05 2006 April
51861 2006-04-30T22:00:00Z 0.88 0.52 0.66 1.71 0.24 87.019997 93.669998 0.84 21.930000 66.769997 0.74 6.19 1.35 2.96 2006 April
51862 2006-04-30T23:00:00Z 1.24 0.57 1.03 2.58 0.24 91.360001 100.400002 1.29 13.170000 56.610001 1.14 6.34 1.36 4.97 2006 April
51863 2006-05-01T00:00:00Z 1.26 0.60 1.09 2.95 0.27 98.050003 129.300003 1.52 9.340000 45.610001 1.24 6.98 1.41 5.95 2006 May

51864 rows × 17 columns

In [45]:
Grade cell: cell-902507dc58ec5428 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(madrid is None)) == "<class 'bool'>", "type of madrid is None is not bool. madrid is None should be a bool"
assert str(madrid is None) == "False", "boolean value of madrid is None is not correct"

assert str(type(madrid)) == "<class 'pandas.core.frame.DataFrame'>", "type of type(madrid) is not correct"

assert str(type(madrid.shape)) == "<class 'tuple'>", "type of madrid.shape is not tuple. madrid.shape should be a tuple"
assert str(len(madrid.shape)) == "2", "length of madrid.shape is not correct"
assert str(sorted(map(str, madrid.shape))) == "['17', '51864']", "values of madrid.shape are not correct"
assert str(madrid.shape) == "(51864, 17)", "order of elements of madrid.shape is not correct"

assert str(type(madrid.columns.values)) == "<class 'numpy.ndarray'>", "type of madrid.columns.values is not correct"
assert str(madrid.columns.values) == "['date' 'BEN' 'CO' 'EBE' 'MXY' 'NMHC' 'NO_2' 'NOx' 'OXY' 'O_3' 'PM10'\n 'PXY' 'SO_2' 'TCH' 'TOL' 'year' 'mnth']", "value of madrid.columns.values is not correct"

assert sha1(str(type(sum(madrid.BEN.dropna()))).encode("utf-8")+b"726cd").hexdigest() == "03fe490f19ff671ad5e9acbfa415d20d68131b3f", "type of sum(madrid.BEN.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(madrid.BEN.dropna()), 2)).encode("utf-8")+b"726cd").hexdigest() == "6a2b83454667313b322446dbb244e523de8c97bd", "value of sum(madrid.BEN.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 3.3
{points: 1}

Now that the data is loaded in Python, create a scatter plot that compares ethylbenzene (EBE) values against the date they were recorded. This graph will showcase the concentration of ethylbenzene in Madrid over time. As usual, label your axes:

  • x = Date
  • y = Ethylbenzene (μg/m³)

Assign your answer to an object called EBE_pollution.

In [46]:
Student's answer(Top)
EBE_pollution = alt.Chart(madrid).mark_point().encode(
    x=alt.X("date:T").title("Date"),
    y=alt.Y("EBE").title("Ethylbenzene (μg/m³)")
).properties(width=800)

EBE_pollution

# Are levels increasing or decreasing?
Out[46]:
In [47]:
Grade cell: cell-4de75a47d9cc2dca Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert sha1(str(type(EBE_pollution is None)).encode("utf-8")+b"a9d2a").hexdigest() == "a0e349c1bbcc63261717f36d3df5e6a8166798aa", "type of EBE_pollution is None is not bool. EBE_pollution is None should be a bool"
assert sha1(str(EBE_pollution is None).encode("utf-8")+b"a9d2a").hexdigest() == "906faf6cc310ce3bc9e58710807fab5e019e10bb", "boolean value of EBE_pollution is None is not correct"

assert sha1(str(type(EBE_pollution.encoding.x['shorthand'])).encode("utf-8")+b"a9d2b").hexdigest() == "4a2c15aeaa3561c1b160db2841acc4916f2ea0a9", "type of EBE_pollution.encoding.x['shorthand'] is not str. EBE_pollution.encoding.x['shorthand'] should be an str"
assert sha1(str(len(EBE_pollution.encoding.x['shorthand'])).encode("utf-8")+b"a9d2b").hexdigest() == "9a966438a9335f07c7ab99680f03740a551f3349", "length of EBE_pollution.encoding.x['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.x['shorthand'].lower()).encode("utf-8")+b"a9d2b").hexdigest() == "d52ca32df8a10564130e347fbbcd73758506209e", "value of EBE_pollution.encoding.x['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.x['shorthand']).encode("utf-8")+b"a9d2b").hexdigest() == "8957122af4fcb290b5582173aa0387f6b1bdfe23", "correct string value of EBE_pollution.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(EBE_pollution.encoding.y['shorthand'])).encode("utf-8")+b"a9d2c").hexdigest() == "9bb2301992ce7d3710551a31ecec956325569604", "type of EBE_pollution.encoding.y['shorthand'] is not str. EBE_pollution.encoding.y['shorthand'] should be an str"
assert sha1(str(len(EBE_pollution.encoding.y['shorthand'])).encode("utf-8")+b"a9d2c").hexdigest() == "bf0ecc21d8e0ec14a1d82feef48945b87dc0c638", "length of EBE_pollution.encoding.y['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.y['shorthand'].lower()).encode("utf-8")+b"a9d2c").hexdigest() == "2eba0f8321b01edd2441278de56296ce411f9d34", "value of EBE_pollution.encoding.y['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.y['shorthand']).encode("utf-8")+b"a9d2c").hexdigest() == "3ddcf710a5e27434625af9b94829b807d4eab866", "correct string value of EBE_pollution.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(EBE_pollution.mark)).encode("utf-8")+b"a9d2d").hexdigest() == "17fb0e61b747a5e9e4129ab606c8eb508f52039c", "type of EBE_pollution.mark is not str. EBE_pollution.mark should be an str"
assert sha1(str(len(EBE_pollution.mark)).encode("utf-8")+b"a9d2d").hexdigest() == "37b979bb7f2bd66a5efd595eb09e69e6213523cd", "length of EBE_pollution.mark is not correct"
assert sha1(str(EBE_pollution.mark.lower()).encode("utf-8")+b"a9d2d").hexdigest() == "05c136c31ceea21cd0dec71be39b02b84d166701", "value of EBE_pollution.mark is not correct"
assert sha1(str(EBE_pollution.mark).encode("utf-8")+b"a9d2d").hexdigest() == "05c136c31ceea21cd0dec71be39b02b84d166701", "correct string value of EBE_pollution.mark but incorrect case of letters"

assert sha1(str(type(isinstance(EBE_pollution.encoding.x['title'], str))).encode("utf-8")+b"a9d2e").hexdigest() == "5f7e5633cb58bec8771aaa86245dbac724a635e6", "type of isinstance(EBE_pollution.encoding.x['title'], str) is not bool. isinstance(EBE_pollution.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(EBE_pollution.encoding.x['title'], str)).encode("utf-8")+b"a9d2e").hexdigest() == "a1bca98d8deee1c84e78240b5a553cb250908b24", "boolean value of isinstance(EBE_pollution.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(EBE_pollution.encoding.y['title'], str))).encode("utf-8")+b"a9d2f").hexdigest() == "17ab4f5489e8d7160934f82de3ec69fbe0b1b8ed", "type of isinstance(EBE_pollution.encoding.y['title'], str) is not bool. isinstance(EBE_pollution.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(EBE_pollution.encoding.y['title'], str)).encode("utf-8")+b"a9d2f").hexdigest() == "df9aaae0c9b82791110f438b5c57122783126078", "boolean value of isinstance(EBE_pollution.encoding.y['title'], str) is not correct"

print('Success!')
Success!

We can see from this plot that over time, there are less and less high (> 25 μg/m³) EBE values.

Question 3.4
{points: 1}

The question above asks you to write out code that allows visualization of all EBE recordings - which are taken every single hour of every day. Consequently the graph consists of many points and appears so densely plotted that it is difficult to interpret. In this question, we are going to clean up the graph and focus on max EBE readings from each month. To further investigate if this trend is changing over time, we will use groupby and max to create a new data set.

Fill in the ___ in the cell below.

Assign your answer to an object called madrid_pollution.

In [48]:
Student's answer(Top)
madrid_pollution = madrid.groupby(["year", "mnth"]).max("EBE").reset_index()

madrid_pollution
Out[48]:
year mnth BEN CO EBE MXY NMHC NO_2 NOx OXY O_3 PM10 PXY SO_2 TCH TOL
0 2001 April 11.77 3.81 12.480000 31.440001 0.71 186.199997 657.400024 13.650000 106.099998 163.600006 11.770000 49.320000 3.26 59.549999
1 2001 August 6.08 2.67 8.390000 22.879999 2.42 145.500000 455.600006 9.390000 125.199997 160.000000 8.130000 22.940001 3.99 57.570000
2 2001 December 26.49 7.60 77.260002 76.269997 1.34 271.299988 1416.000000 89.510002 47.599998 266.299988 103.000000 137.100006 4.77 155.000000
3 2001 February 19.41 7.10 18.860001 47.709999 1.17 197.000000 1053.000000 19.240000 75.629997 204.800003 16.110001 105.000000 3.21 97.169998
4 2001 January 14.79 7.89 11.720000 28.450001 1.33 170.100006 1100.000000 12.440000 64.379997 231.899994 10.390000 60.439999 3.84 64.660004
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67 2006 March 6.27 2.32 6.710000 17.850000 0.65 195.100006 782.099976 7.710000 63.980000 186.899994 5.760000 36.090000 2.34 55.240002
68 2006 May 4.45 1.26 4.460000 11.610000 0.51 142.199997 516.900024 5.260000 88.300003 183.699997 4.070000 19.209999 2.84 63.560001
69 2006 November 8.09 2.56 12.380000 26.799999 NaN 172.699997 657.000000 11.400000 51.930000 166.399994 8.910000 21.639999 NaN 51.970001
70 2006 October 8.31 2.85 19.750000 32.990002 NaN 150.199997 658.099976 31.389999 69.089996 219.699997 26.950001 20.650000 NaN 57.580002
71 2006 September 4.14 1.62 12.560000 31.719999 NaN 168.899994 451.000000 7.080000 106.800003 268.600006 5.590000 16.200001 NaN 42.200001

72 rows × 16 columns

In [49]:
Grade cell: cell-d04ca4acf0f5f6bc Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(madrid_pollution is None)).encode("utf-8")+b"cf51").hexdigest() == "876977053ebb740f9d5c7027d2e8764948b5fcb7", "type of madrid_pollution is None is not bool. madrid_pollution is None should be a bool"
assert sha1(str(madrid_pollution is None).encode("utf-8")+b"cf51").hexdigest() == "c68ccdf9a8e99656550fea948cafacf5bf1fe238", "boolean value of madrid_pollution is None is not correct"

assert sha1(str(type(madrid_pollution.shape)).encode("utf-8")+b"cf52").hexdigest() == "b3f63db75adce4cb1020a21a8c62c4e9beedab9b", "type of madrid_pollution.shape is not tuple. madrid_pollution.shape should be a tuple"
assert sha1(str(len(madrid_pollution.shape)).encode("utf-8")+b"cf52").hexdigest() == "59d56cdde93a061d7660d2ab3df9da95cea20bf7", "length of madrid_pollution.shape is not correct"
assert sha1(str(sorted(map(str, madrid_pollution.shape))).encode("utf-8")+b"cf52").hexdigest() == "bada56d28fa0fb6b371e211b33d42118b750569e", "values of madrid_pollution.shape are not correct"
assert sha1(str(madrid_pollution.shape).encode("utf-8")+b"cf52").hexdigest() == "220fad5306cd97dca2f991f6c909b6fd2c7c66d0", "order of elements of madrid_pollution.shape is not correct"

assert sha1(str(type(sum(madrid_pollution.year))).encode("utf-8")+b"cf53").hexdigest() == "965d57172627d2359d0c679caf5f640ce55df965", "type of sum(madrid_pollution.year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(madrid_pollution.year)).encode("utf-8")+b"cf53").hexdigest() == "1384199bad5be9e7dccf25ace61988319f2a861e", "value of sum(madrid_pollution.year) is not correct"

print('Success!')
Success!

Question 3.5
{points: 1}

Plot the new maximum EBE values versus the month they were recorded, split into side-by-side plots for each year. Again, we will use facetting (more on this next week) to plot each year side-by-side.

Assign your answer to an object called madrid_plot. Remember to label your axes.

In [50]:
Student's answer(Top)
madrid_plot = alt.Chart(madrid_pollution).mark_point().encode(
    x=alt.X("mnth").title("Month"),
    y=alt.Y("EBE").title("Ethylbenzene (μg/m³)")
).facet("year:N")
madrid_plot
Out[50]:
In [51]:
Grade cell: cell-b12cea24ac607772 Score: 1.0 / 1.0 (Top)
from hashlib import sha1
assert str(type(madrid_plot is None)) == "<class 'bool'>", "type of madrid_plot is None is not bool. madrid_plot is None should be a bool"
assert str(madrid_plot is None) == "False", "boolean value of madrid_plot is None is not correct"

assert str(type(madrid_plot.facet)) == "<class 'altair.vegalite.v5.schema.channels.Facet'>", "type of madrid_plot.facet is not correct"
assert str(madrid_plot.facet) == "Facet({\n  shorthand: 'year:N'\n})", "value of madrid_plot.facet is not correct"

print('Success!')
Success!

Question 3.6
{points: 1}

Now we want to see which of the pollutants has decreased the most. Therefore, we must repeat the same thing that we did in the questions above but for every pollutant (using the original data set)!

First we will look at Madrid pollution in 2001 (filter for this year). Next we have to drop the columns that should be excluded (such as the date). Lastly, use the max function to create max values for all columns.

Note: The max function would return a pandas series. But since we would need a dataframe for later exercises, we need to convert the series to a dataframe by using pd.DataFrame. Applying transpose to the dataframe turns each row into a column, which is also helpful for later exercises.

Fill in the ___ in the cell below.

Assign your answer to an object called pollution_2001.

In [52]:
Student's answer(Top)
pollution_2001 = pd.DataFrame(
    madrid
    [madrid["year"]==2001]
    .drop(columns=["date", "year", "mnth"])
    .max()
).transpose()

pollution_2001
Out[52]:
BEN CO EBE MXY NMHC NO_2 NOx OXY O_3 PM10 PXY SO_2 TCH TOL
0 49.939999 10.39 77.260002 93.120003 2.42 271.299988 1416.0 103.0 173.100006 266.299988 103.0 137.100006 4.77 242.899994
In [53]:
Grade cell: cell-23ecb8f3102f3435 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(pollution_2001 is None)).encode("utf-8")+b"1db17").hexdigest() == "f0c03694d844f9b9103e6cd384a564d6eaf7826d", "type of pollution_2001 is None is not bool. pollution_2001 is None should be a bool"
assert sha1(str(pollution_2001 is None).encode("utf-8")+b"1db17").hexdigest() == "d4bc860655e2f4464c4e5a7520e73af46948c080", "boolean value of pollution_2001 is None is not correct"

assert sha1(str(type(pollution_2001.shape)).encode("utf-8")+b"1db18").hexdigest() == "3738584509e958c63e033fb0f98969f52a65dfe1", "type of pollution_2001.shape is not tuple. pollution_2001.shape should be a tuple"
assert sha1(str(len(pollution_2001.shape)).encode("utf-8")+b"1db18").hexdigest() == "a1e0ba5f54257acb5f4a0ab4b6366aa225b15f0d", "length of pollution_2001.shape is not correct"
assert sha1(str(sorted(map(str, pollution_2001.shape))).encode("utf-8")+b"1db18").hexdigest() == "0f94fcfbdc46ee9af28088d1aedb9a73387f4836", "values of pollution_2001.shape are not correct"
assert sha1(str(pollution_2001.shape).encode("utf-8")+b"1db18").hexdigest() == "6a58605bee7ba5cc55d49280c8003774663584d3", "order of elements of pollution_2001.shape is not correct"

assert sha1(str(type(pollution_2001.MXY.values)).encode("utf-8")+b"1db19").hexdigest() == "552c67ca0f7db3e2cdc92ee631817d5713c513d8", "type of pollution_2001.MXY.values is not correct"
assert sha1(str(pollution_2001.MXY.values).encode("utf-8")+b"1db19").hexdigest() == "aed5d19a41bc938f1e0e31de3d59c0b7e8d607bd", "value of pollution_2001.MXY.values is not correct"

assert sha1(str(type(pollution_2001.values.sum())).encode("utf-8")+b"1db1a").hexdigest() == "b93f13117145ff14e2cae22de6f58acf4e63e55b", "type of pollution_2001.values.sum() is not correct"
assert sha1(str(pollution_2001.values.sum()).encode("utf-8")+b"1db1a").hexdigest() == "c903c103becc7cb00f545f1ccc48b8dc2646b479", "value of pollution_2001.values.sum() is not correct"

print('Success!')
Success!

Question 3.7
{points: 1}

Now repeat what you did for Question 3.6, but filter for 2006 instead.

Assign your answer to an object called pollution_2006.

In [54]:
Student's answer(Top)
pollution_2006 = pd.DataFrame(
    madrid
    [madrid["year"]==2006]
    .drop(columns=["date", "year", "mnth"])
          .max()
         ).transpose()
pollution_2006
Out[54]:
BEN CO EBE MXY NMHC NO_2 NOx OXY O_3 PM10 PXY SO_2 TCH TOL
0 16.9 3.48 19.99 54.869999 0.97 287.100006 1274.0 31.389999 132.0 268.600006 26.950001 66.220001 2.84 64.839996
In [55]:
Grade cell: cell-df023e23794302c6 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert str(type(pollution_2006 is None)) == "<class 'bool'>", "type of pollution_2006 is None is not bool. pollution_2006 is None should be a bool"
assert str(pollution_2006 is None) == "False", "boolean value of pollution_2006 is None is not correct"

assert str(type(pollution_2006.shape)) == "<class 'tuple'>", "type of pollution_2006.shape is not tuple. pollution_2006.shape should be a tuple"
assert str(len(pollution_2006.shape)) == "2", "length of pollution_2006.shape is not correct"
assert str(sorted(map(str, pollution_2006.shape))) == "['1', '14']", "values of pollution_2006.shape are not correct"
assert str(pollution_2006.shape) == "(1, 14)", "order of elements of pollution_2006.shape is not correct"

assert sha1(str(type(pollution_2006.MXY.values)).encode("utf-8")+b"68551").hexdigest() == "6c28ec0ed95f204c10dfbe48fbc6faa10546c3e8", "type of pollution_2006.MXY.values is not correct"
assert sha1(str(pollution_2006.MXY.values).encode("utf-8")+b"68551").hexdigest() == "2cf333853441c598a117f86dd6a4f3a80ed758ba", "value of pollution_2006.MXY.values is not correct"

assert sha1(str(type(pollution_2006.values.sum())).encode("utf-8")+b"68552").hexdigest() == "0aece6d927697ffd9f810cfb9db22d03dde0f6af", "type of pollution_2006.values.sum() is not correct"
assert sha1(str(pollution_2006.values.sum()).encode("utf-8")+b"68552").hexdigest() == "791b2443fefbf621fc04a55c4bd1a9730bef0982", "value of pollution_2006.values.sum() is not correct"

print('Success!')
Success!

Question 3.8
{points: 1}

Which pollutant decreased by the greatest magnitude between 2001 and 2006? Given that your the two objects you just created, pollution_2001 and pollution_2006 are data frames with the same columns you should be able to subtract the two objects to find which pollutant decreased by the greatest magnitude between the two years.

Assign your answer to an object called answer3_8. Make sure to write the answer exactly as it is given in the data set. Example:

answer3_8 = "BEN"
In [56]:
Student's answer(Top)
answer3_8 = "TOL"
In [57]:
Grade cell: cell-5ed11e7cbe1ac843 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(answer3_8)).encode("utf-8")+b"20669").hexdigest() == "a590c6bd065e3e3a6b1c99e3f1161e34c6782a27", "type of answer3_8 is not str. answer3_8 should be an str"
assert sha1(str(len(answer3_8)).encode("utf-8")+b"20669").hexdigest() == "92297e7890dbc44acf7788ecaeb2c2637ed8450b", "length of answer3_8 is not correct"
assert sha1(str(answer3_8.lower()).encode("utf-8")+b"20669").hexdigest() == "932355298aa7aea4f271da291849801e2dabdcec", "value of answer3_8 is not correct"
assert sha1(str(answer3_8).encode("utf-8")+b"20669").hexdigest() == "b54f90203aaf322b9067fe446e341b7fa144c45c", "correct string value of answer3_8 but incorrect case of letters"

print('Success!')
Success!

Question 3.9
{points: 1}

Given that there were only 14 columns in the data frame above, you could use your eyes to pick out which pollutant decreased by the greatest magnitude between 2001 and 2006. But what would you do if you had 100 columns? Or 1000 columns? It would take A LONG TIME for your human eyeballs to find the biggest difference. Maybe you could use the min funcion by specifying axis=1 (horizontally):

In [58]:
# run this cell
(pollution_2006 - pollution_2001).min(axis=1)
Out[58]:
0   -178.059998
dtype: float64

This is a step in the right direction, but you get the value and not the column name... What are we to do? Tidy our data! Our data is not in tidy format, and so it's difficult to access the values for the variable pollutant because they are stuck as column headers. Let's use melt to tidy our data and make it look like this:

pollutant value
BEN -33.04
CO -6.91
... ...

To answer this question, fill in the ___ in the cell below.

Assign your answer to an object called pollution_diff and ensure it has the same column names as the table pictured above.

In [59]:
Student's answer(Top)
pollution_diff = pollution_2006 - pollution_2001
pollution_diff = pollution_diff.melt(var_name= "pollutant", value_name="value")


pollution_diff
Out[59]:
pollutant value
0 BEN -33.039999
1 CO -6.910000
2 EBE -57.270002
3 MXY -38.250004
4 NMHC -1.450000
5 NO_2 15.800018
6 NOx -142.000000
7 OXY -71.610001
8 O_3 -41.100006
9 PM10 2.300018
10 PXY -76.049999
11 SO_2 -70.880005
12 TCH -1.930000
13 TOL -178.059998
In [60]:
Grade cell: cell-434094b036007273 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(pollution_diff is None)).encode("utf-8")+b"e6dc4").hexdigest() == "a48291d3d3d8f4bb8d8ecdc4bdbb96f3f0637be6", "type of pollution_diff is None is not bool. pollution_diff is None should be a bool"
assert sha1(str(pollution_diff is None).encode("utf-8")+b"e6dc4").hexdigest() == "45b6c8fa5f7363e5aed7056c51b17e93270b6302", "boolean value of pollution_diff is None is not correct"

assert sha1(str(type(pollution_diff.shape)).encode("utf-8")+b"e6dc5").hexdigest() == "1e996ccca590a01916e00623d4a0e098a8056c90", "type of pollution_diff.shape is not tuple. pollution_diff.shape should be a tuple"
assert sha1(str(len(pollution_diff.shape)).encode("utf-8")+b"e6dc5").hexdigest() == "664f8ba883064516d74c515c3ae64181dcd8972c", "length of pollution_diff.shape is not correct"
assert sha1(str(sorted(map(str, pollution_diff.shape))).encode("utf-8")+b"e6dc5").hexdigest() == "ac0ec142b7dbcd9228b5a02b4605a87275f50af5", "values of pollution_diff.shape are not correct"
assert sha1(str(pollution_diff.shape).encode("utf-8")+b"e6dc5").hexdigest() == "ccb7b0922316382da52f068c39fd1101ea541cf2", "order of elements of pollution_diff.shape is not correct"

assert sha1(str(type(pollution_diff.columns.values)).encode("utf-8")+b"e6dc6").hexdigest() == "9a0186d44f17405af9c057a4a44a83c9dcab216d", "type of pollution_diff.columns.values is not correct"
assert sha1(str(pollution_diff.columns.values).encode("utf-8")+b"e6dc6").hexdigest() == "e107e69889e920d1f8525444cc287cde6ada715c", "value of pollution_diff.columns.values is not correct"

assert sha1(str(type(sum(pollution_diff.value))).encode("utf-8")+b"e6dc7").hexdigest() == "25ab0d88c283434ea6b78f10549dc93d06f1a889", "type of sum(pollution_diff.value) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(pollution_diff.value), 2)).encode("utf-8")+b"e6dc7").hexdigest() == "fbc37cf355dc8ff85c0e3c2d3049a735c66d7bc5", "value of sum(pollution_diff.value) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

Question 3.10
{points: 1}

Now that you have tidy data, you can use sort_values and argument ascending=False to order the data in descending order. Each element of the value column corresponds to an amount of decrease in a pollutant; so the largest decrease in pollutant should be most negative entry, i.e., the last row in the resulting dataframe. Therefore, we can take the sorted dataframe and chain it to tail (with the argument 1) to return only the last row of the data frame.

(the function tail is just like head, except it returns the last rows of the dataframe instead of the first rows.)

To answer this question, fill in the ___ in the cell below.

Assign your answer to an object called max_pollution_diff.

In [61]:
Student's answer(Top)
max_pollution_diff = pollution_diff.sort_values(by="value", ascending=False).tail(1)
max_pollution_diff
Out[61]:
pollutant value
13 TOL -178.059998
In [62]:
Grade cell: cell-f9cfd97235c900d5 Score: 3.0 / 3.0 (Top)
from hashlib import sha1
assert sha1(str(type(max_pollution_diff is None)).encode("utf-8")+b"b0f20").hexdigest() == "3319329277b4c4b28b0da692d3db61099e3f315e", "type of max_pollution_diff is None is not bool. max_pollution_diff is None should be a bool"
assert sha1(str(max_pollution_diff is None).encode("utf-8")+b"b0f20").hexdigest() == "225841cc554f68dc91fe94a673d1ae1fbba9f2df", "boolean value of max_pollution_diff is None is not correct"

assert sha1(str(type(max_pollution_diff.shape)).encode("utf-8")+b"b0f21").hexdigest() == "00b91903947111b8d2c4547e39a7e623ce2222c7", "type of max_pollution_diff.shape is not tuple. max_pollution_diff.shape should be a tuple"
assert sha1(str(len(max_pollution_diff.shape)).encode("utf-8")+b"b0f21").hexdigest() == "a6a8476d3bbebc4d7c91984247fc773e232a4230", "length of max_pollution_diff.shape is not correct"
assert sha1(str(sorted(map(str, max_pollution_diff.shape))).encode("utf-8")+b"b0f21").hexdigest() == "de4229be5bda7d5e2f034d881048e3adb307c5f4", "values of max_pollution_diff.shape are not correct"
assert sha1(str(max_pollution_diff.shape).encode("utf-8")+b"b0f21").hexdigest() == "1b8d705605d870abb3a849a5d0abb8a11fbf2e2d", "order of elements of max_pollution_diff.shape is not correct"

assert sha1(str(type(max_pollution_diff.columns.values)).encode("utf-8")+b"b0f22").hexdigest() == "68b2531b118b88080b3ef9d7c5f8ba3e8cc6b566", "type of max_pollution_diff.columns.values is not correct"
assert sha1(str(max_pollution_diff.columns.values).encode("utf-8")+b"b0f22").hexdigest() == "a2a990d706c975f03b9ee7e4f818e05ebc3163f2", "value of max_pollution_diff.columns.values is not correct"

assert sha1(str(type(sum(max_pollution_diff.value))).encode("utf-8")+b"b0f23").hexdigest() == "fb731a4d61d086072658f6732a837e475fdf0c0d", "type of sum(max_pollution_diff.value) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(max_pollution_diff.value), 2)).encode("utf-8")+b"b0f23").hexdigest() == "59ec18405c4e8c2f7e58e3626713f427f1ee74be", "value of sum(max_pollution_diff.value) is not correct (rounded to 2 decimal places)"

print('Success!')
Success!

At the end of this data wrangling worksheet, we'll leave you with a couple quotes to ponder:

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

In [ ]: